Machine Learning - Reddit Data

Conclusion

The results demonstrate that Naive Bayes outperforms Logistic Regression in classifying Reddit comments into cancer-related and non-cancer categories with higher accuracy, precision, and recall, making it better suited for identifying cancer-related content. Its probabilistic approach and independence assumption handle text data effectively.

Logistic Regression, though flexible, had higher false negatives, making it less suitable for this task. Overall, Naive Bayes is the preferred choice for its simplicity, efficiency, and superior performance.

Reflection

Understanding whether a comment is cancer-related is crucial for enhancing online support systems and health communication. Cancer-related subreddits often serve as platforms for individuals seeking emotional support, sharing personal experiences, or discussing treatment options. Automatically identifying such comments can help moderators curate relevant content, ensure timely responses to critical queries, and provide researchers with valuable insights into public sentiment, challenges, and trends related to cancer. This understanding is particularly important for designing interventions, improving healthcare accessibility, and fostering a supportive community for those affected by cancer.